RNA-seq Data Analysis
1 The Data
This exercise uses RNA-seq data from FU et al. 2015, Nature Cell Biology. The data compares gene expression in luminal cells of pregnant versus lactating mice. You can download the data from this Zenodo repository and you can have a look at this Galaxy tutorial for some ideas on how to get started.
The dataset consists of three CSV files:
heatmap_genes.csv: A list of interesting genes to analyze (used in Figure 6b of the paper)DE_results.csv: Differential expression results, including logFC, AveExpr, t-statistic, and p-value for all genesnormalized_counts.csv: Normalized counts for genes across different samples
To simplify the analysis you can consider the following:
- Read the CSV files using
read_csv() - Clean column names with
janitor::clean_names()to make them easier to work with - Combining relevant data frames using
left_join()(DE_resultsandnormalized_countsshare a common column) - Use
selectto remove columns you don’t need for analysis to get a better overview - Filter only significant genes (this tutorial defines them as p_value < 0.01 & abs(logFC) > 0.58)
2 Questions
What are the top 20 most significantly differentially expressed genes?
- Tip: Filter the genes based on p-value and log fold change, then order by p-value.
Create a heatmap of the top 20 most significant genes. How do the expression patterns differ between pregnant and lactating samples?
- Checkout the
pheatmappackage for creating heatmaps.
Create a volcano plot to visualize the differential expression results. How many genes are significantly up-regulated or down-regulated?
- First create a simple volcano plot
- Then you can try to label the 10 most significant genes in the volcano plot
- Check my R script and/or this tutorial on how to create a volcano plot with ggplot
Perform a Principal Component Analysis (PCA) on the top 20 significant genes. What can you conclude about the separation of pregnant and lactating samples?
- Tip: Use
prcomp()for PCA andfactoextra::fviz_pca_ind()for visualization.
3 Useful Functions
To complete these tasks, you may find the following R functions and packages helpful:
- Data manipulation:
janitor::clean_names(): Clean column namesleft_join(): Combine data frames
- Statistical analysis:
scale(): Scale numeric variablesprcomp(): Perform Principal Component Analysis
- Visualization:
pheatmap::pheatmap(): Create heatmaps (needs a matrix, so you need to reshape your data first into a matrix)ggrepel::geom_text_repel(): Add non-overlapping labels to plotsfactoextra::fviz_pca_ind(): Visualize PCA results
- Other useful functions:
as.matrix(): Convert data frames to matricesdrop_na(): Drop rows with missing values (maybe needed for PCA because it doesn’t handle missing values)